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ACCELERATOR ENGINE FOR PROCESSING FUNCTIONS USED IN AUDIO 

ALGORITHMS 



BACKGROUND OF THE INVENTION 

5 1. Technical Field 

This invention relates generally to audio digital signal processing (DSP) and more 
particularly to processing functions used in audio algorithms. 
2. Discussion of Background Art 

An Inverse Discrete Cosine Transformation (IDCT), which transforms data from 
10 the frequency domain to the time domain, requires a pre-multiplication, an inverse Fast 
Fourier Transformation (IFFT), and a post-multiplication. IDCT is used as one of the last 
stages of the Dolby®' s third generation audio coding (AC3) decompressing process. 

Performing a 128-point IFFT requires 64*7 (=448) radix 2 butterflies, each 
defined by 

15 A ' = A - 5C and B' = A + BC, 

where A, A \ S, B' and C are complex numbers in the form of D=d r +]di. A subscript r 
denotes the real part, and a subscript / denotes the imaginary part of the complex number. 

FIG. 1 shows an audio integrated circuit chip 100 that includes a DSP 102, a 
Random Access Memory (RAM) 106, a Read Only Memory (ROM) 1 10, and an 

20 Input/Output (I/O) interface 1 14. DSP 102 provides the main digital signal processing for 
chip 100 including, for example, filtering and transforming audio samples. RAM 106 is a 
''memory on chip" used by programs running on DSP 102 to store data relating to input 
audio samples and the processing performed on those samples. ROM 110 stores 
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additional data for DSP 102. I/O interface 1 14 implements various protocols for 
exchanging, via databus 4005, audio samples with external devices such as analog-to- 
digital (A/D) and digital-to-analog (D/A) converters, etc. 

As audio AC3 and surround sound features are added to chip 100, DSP 102 
5 cannot perform as fast as desired, that is, it cannot execute as many million instructions 
per second (MIPs) as are required to do all of the tasks demanded by chip 100, including, 
for example, receiving AC3 data, detecting AC3 data error, using IDCT to decode data, 
and performing audio enhancement and 3D features. One approach to improving DSP 
102 performance, or to conserving DSP 102 MIPs for other desired functions, accelerates 
10 the AC3 decompressing stage. However, this approach requires operations that are tightly 
coupled with the AC3 algorithm, which in turn requires that a designer be intimately 
familiar with the AC3 decoding process. Further, the approach is useful for accelerating 
AC3, but provides no benefits for other audio functions or Moving Picture Expert Group 
(MPEG) functions. 

15 Therefore, what is needed is a mechanism to efficiently improve performance of 

DSP 102, and thereby performance of chip 100. 
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SUMMARY OF THE INVENTION 
The present invention provides an accelerator engine running in parallel with a 
DSP in an audio chip to improve its performance. The engine avoids AC3-specific 
5 algorithms and focuses on general purpose DSP functions, including biquad filtering and 
IDCT, that comprise pre-multiplication, IFFT, and post-multiplication. The DSP is 
therefore free to do other processing while the engine is performing a requested function. 
The engine utilizes its resources in a pipeline structure for increased efficiency. In the 
biquad and double precision biquad modes the engine stores data in predefined locations in 
10 memory to efficiently access the data. The engine also uses an equation and data values 
stored in the predefined locations to calculate an audio sample. The calculated result is 
then stored in a memory location in a way that it can easily be used to calculate the next 
sample. In a preferred embodiment the engine efficiently saves 15 MIPs in an AC3 based 
3D audio product, including 7.5 MIPs in the AC3 decoding process by accelerating the 
15 IFFT, and 7.5 MIPs from the 3D processing via a biquad filtering function. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 shows a prior art DSP chip; 

FIG. 2 shows a chip utilizing the accelerator engine according to the invention; 
FIG. 3 is a block diagram of the accelerator engine; 
5 FIG. 4A shows a pipeline structure for the pre-multiplication mode; 

FIG. 4B shows a resource utilization map for the pre-multiplication mode; 
FIG. 5 A shows a pipeline structure for the FFT mode; 
FIG. 5B shows a resource utilization map for the FFT mode; 

FIG. 5C shows C code describing the address generation for both 128-point and 64-point 
10 FFTs; 

FIG. 6A shows a pipeline structure for the biquad filtering mode; 
FIG. 6B shows a resource utilization map for the biquad filtering mode; 
FIG. 6C is a flowchart illustrating how the accelerator engine processes the biquad 
filtering; 

15 FIG. 6D is a flowchart illustrating how the accelerator engine stores data during the 
biquad filtering mode; 
FIG. 6E illustrates data in memory during a biquad filtering mode; 
FIG. 7A shows a pipeline structure for the double precision biquad filtering mode; 
FIG. 7B shows a resource utilization map for the double precision biquad filtering mode; 
20 FIG. 7C illustrates data in memory during a double precision biquad filtering mode; and 
FIG. 8 is a flowchart illustrating how a chip requests that a function be performed by the 
accelerator engine. 
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DETAIL DESCRIPTION OF THE PREFERRED EMBODIMENT 
The present invention provides an accelerator engine running in parallel with a 
DSP for processing functions that are usable by audio algorithms including, for example, 
AC3, 3D, bass managements, MP3, etc., and that would otherwise be performed by the 
5 DSP. Consequently, the DSP is free to process other tasks. 

FIG. 2 shows a chip 250 utilizing the invention. Integrated circuit chip 250 is like 
chip 100 except that chip 250 includes an accelerator engine 200 interfacing via data bus 
2003 with DSP 102. Data passing through databus 2003 includes configuration 

□ information and data for accelerator engine 200 to perform a requested function. 

§ JUO FIG. 3 is a block diagram of accelerator engine 200 preferably including an RRAM 

jjj 304, an IRAM 308, a ROM 3 12, a control register 3 16, a state machine 320, a multiplier 
!S (MPY) 326, a shift/sign extender 328, an ALU 330, and other components. In the 

□ preferred embodiment accelerator engine 200 supports the following functions: single 
^ biquad filtering; double precision biquad filtering; radix2, 7 passes, 128-point IFFT; 
^15 radix2, 6 passes, 64-point IFFT, premultiplication 128- word and 64-word configurations, 

post-multiplication 128-word and 64-word configurations, IDCT 128-word and 64-word 
configurations, and BFE-RAM read/write. 

RRAM 304, depending on the required function, stores different types of values. 
For example, for a biquad filtering function, RRAM 304 stores filter coefficients. For 
20 IFFT and IDCT functions, RRAM 304 stores real parts and IRAM 308 stores imaginary 
parts of complex numbers required by these functions. Data in RRAM 304 and IRAM 
308, where appropriate, contains new values after each time accelerator engine 200 has 
calculated an audio sample. 
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ROM 312 is used in IFFT and IDCT modes preferably to store both real and 
imaginary values for complex numbers required by IFFT and IDCT functions. 

State machine 320 generates addresses for RRAM 304, IRAM 308, and ROM 312 
and control signals for other circuitry of accelerator engine 200. 
5 MPY 326 is preferably a conventional 24-bit multiplier for multiplying data on 

lines 3031 and 3035. 

Shifter/Sign-Extender 328 shifts its contents to adjust or mask a sign bit on lines 
3041 and 3045. 

ALU 330 is preferably a conventional ALU which accumules, adds, or subtracts 
10 data on lines 3051 and 3055. ALU 330 can be divided into an ALUA (not shown) and an 
ALUB (not shown) for use when two ALUs are required, such as in the pre- 
multiplication, IFFT, and post-multiplication modes. 

Multiplexers (MUXes) M01, M03, M05, M07, M09, Ml 1, and M13 perform 
conventional functions of MUXes, that is, passing selected inputs based on select signals 
15 (not shown) provided by state machine 320 and appropriate circuitry. 

Latches L01, L03, L05, L07, L08, L09, Lll, L13, L15, and L17 perform 
conventional functions of latches including passing inputs based on clocking signals (not 
shown) provided by state machine 320 and appropriate circuitry. 

This specification uses the following notations in several tables for a pipeline 
20 structure and for a resource utilization map. Each column represents a critical resource 
(RRAM 304, IRAM 308, MPY 326, etc.). Each row represents a phase-one-to-phase-one 
clock cycle, which lasts 20ns in a preferred embodiment. The tables show what each 
resource is doing during each cycle. In a resource utilization map, the number in each 
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entry represents the number of operations a resource is executing during a given cycle. A 
"0" indicates a resource is idle; a "1" indicates a resource is busy. 

In the preferred embodiment the pre- and post-multiplication modes pass data and 
multiply each data item by a unique complex constant preferably stored in ROM 312, that 
5 is, 

A n = A n C n where 
A« = a™ + j dm and 

Cn ~ Cm + j dn 

Parameters a m and a in represent data in RRAM 304 and IRAM 308, respectively, 
10 and C n are filter coefficients in ROM 312. In the preferred embodiment, n ranges from 0 
to 127 (for 128 data points). The pre- and post-multiplication modes are identical except 
that data is accessed linearly in the pre-multiplication mode and in bit-reverse order in the 
post-multiplication mode because the IFFT mode leaves its results in bit-reverse order. 
Coefficients are accessed linearly in both the pre- and the post-multiplication modes. 
15 FIG. 4A shows a pipeline structure for the pre-multiplication mode. In cycle 1 

accelerator engine 200 reads 6 r from RRAM 304, b% from IRAM 308, and c r from ROM 
312. In cycle 2 accelerator engine 200 reads c t from ROM 312. In cycles 3 through 6 
MPY 326 performs b r * c r , b\ * cu b r * c,-, and b\ * c r , respectively. Accelerator engine 
200, instead of using the same cv and c, that were read from ROM 312 in cycles 1 and 2, 
20 rereads c, and c r from ROM 312 in cycles 3 and 4. Rereading these values in cycles 3 and 
4 avoids using a register to store the values read in cycles 1 and 2. Further, ROM 312 is 
available for accessing its data in cycles 3 and 4. ALUA in cycles 6 and 7 performs A = b r 
* cv + 0 and Ao = A - (b\ * c), respectively. ALUB in cycles 8 and 9 performs B = b r * Q 
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and Bo = B + (£,• * c r ), respectively. In cycle 10 accelerator engine 200 writes b r and bi 
into RRAM 304 and IRAM 308, respectively. In this pipeline structure, data required for 
a function in each cycle is made available before the data is needed. For example, b r and c r 
used by MPY 326 in cycle 3 have been made ready to MPY 326 by reading RRAM 304 
5 and IRAM 308 in cycle 1. 

FIG. 4B shows a resource utilization map for the pre-multiplication mode, which, 
starting on any four cycle boundary and taking four cycles, shows the number of 
operations a resource is executing. For example, MPY 326 and ROM 312 perform four 
operations (all four l's), one in each of the four selected cycles 1 to 4. ALUA performs 

10 two operations (one in each of cycles 2 and 3) while ALUB performs two operations (one 
in each of cycles 1 and 4). Data is accessed in cycles 1 and 2 for RRAM 304 and IRAM 
308. Therefore, MPY 326 and ROM 312 are utilized 100% of the time (4 operations in 
four cycles) while each of ALUA, ALUB, RRAM 304, and IRAM 308 is utilized 50% of 
the time (2 operations in 4 cycles). 

15 Accelerator engine 200 preferably uses seven passes of a radix 2 butterfly to 

process IFFTs, and employs the following equations: 

A' =A-BC<mdB' =A + BQ 
where A, A \ 5, B \ and C are complex numbers. 

FIG. 5A shows a pipeline structure for the IFFT mode and FIG. 5B shows a 

20 resource utilization map for the IFFT mode. The explanations of this IFFT pipeline 

structure and resource utilization map are similar to the respective explanations of the pre- 
multiplication mode in FIGs. 4A and 4B. For example, in cycle 1 accelerator engine 200 
reads b r and c r from RRAM 304 and ROM 312 respectively; MPY 326 is utilized 100% of 
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the time because in the selected four cycles MPY 326 performs four operations; etc. 
Consequently, as shown in FIG. 5B, all resources are utilized 100% of the time. 

Accelerator engine 200 performs each pass of the EFFT mode sequentially in both 
128-point IFFT and 64-point IFFT. The difference between each pass is the method by 
5 which data points are addressed. The addressing schemes for the two modes are similar. 
FIG. 5C lists C-code describing the address generation for both 128- point FFT and 64- 
point FFT modes. 

In the biquad filtering mode, accelerator engine 200 uses the equation: 
y n = box n + b&n-i + b2X n . 2 + aiy n -i + a 2 y n -2 (1) 
10 where the subscript n indicates the current sample number; x n is the current input sample 
and y n is the current output sample; and bo, bu bi, ai, and ai are filter coefficients. 

Accelerator engine 200 stores filter coefficients in RRAM 304 and input samples 
and filter states in IRAM 308. In the biquad filtering mode, 48-bit ALU 330 remains as 
one ALU (instead of being divided into two ALUs: ALUA and ALUB). 
15 FIG. 6 A shows a pipeline structure for the biquad filtering mode. 

FIG. 6B shows a resource utilization map for the biquad filtering mode. This 
pipeline structure can be repeated every six cycles and all resources will be used five out 
of every six cycles. 

FIG. 6C is a flowchart illustrating a method for accelerator engine 200 to perform 
20 a biquad filtering in accordance with the invention. In step 604, accelerator engine 200 
receives, for example, 124 sample data points represented by xo to xm. In step 608, 
accelerator engine 200 stores data in IRAM 308. In step 612, accelerator engine 200 uses 
equation (1) to calculate y n for n=0 to n-123, that is, yo to ym, and store them in 
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appropriate locations in IRAM 308. Accelerator engine 200 then continues to receive, 
store, and calculate sampled data in respective steps 604, 608, and 612 until all data has 
been received. Then accelerator engine 200 completes the biquad filtering function in step 
620. 

FIG. 6D is a flowchart illustrating how accelerator engine 200, in accordance with 
steps 608 and 612 of FIG. 6C, calculates and stores y n in IRAM 308 locations for 124 
samples of x n from x 0 to xm. In step 604D accelerator engine 200 stores xo to xm in 
locations 4 to location 127. Accelerator engine 200 also stores values oiy^y.u x 2 , and x 
i in locations 0, 1, 2, and 3, respectively. In this FIG. 6D, locations 0 to 127 are used for 
illustrative purpose only, any 128 locations, for example, 1 to 128, 2 to 129, or Kio K + 
128 - 1 are applicable. In step 608D accelerator engine 200 uses the values in locations 0, 
1, 2, 3, and 4 to calculate y 0 . In step 612D accelerator engine 200 stores the value of y 0 in 
location 2. Accelerator engine 200 then returns to step 608D to calculate yi and store yi 
in location 3, which is one location higher than location 2 storing y 0 . Accelerator engine 
200 keeps calculating and storing values of y until accelerator engine 200 is done, that is, 
accelerator engine 200 calculates and stores values of y 2 to y m in location 4 through 
location 125, respectively. Accelerator engine 200, in calculating y 0 to ym, uses values of 
a 2 , au b 2f b u and b 0 preferably stored in ROM 312. Those skilled in the art will recognize 
that calculating y 0 (n=0) y based on equation (1), requires y. 2 , y-y, x 2 , jc./, and xo. The 
invention uses the zero value for each of y. 2% y./, x. 2j and x.i to calculate the first sequence 
of 1 24 jt„ samples. 

FIG. 6E illustrates how IRAM 308 stores a data value for each y n from y 0 to ym. 
The "Address" column shows locations from 0 to 127 in IRAM 308. The "Initial Data" 
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column shows data of y. 2 , y-i, x 2 , xy, and x 0 to x m in corresponding locations of the 
"Address" column. Columns /i=0, *=/, n=2, n-3, . . . to n=72J, show data in IRAM 
308 locations for y n for n=0 to n=123, respectively. Box 655 includes values (of y. 2 , y./, x 
2, x/, and x 0 ) that are used to calculate yo. Similarly, boxes 659, 663, and 669 include 
5 values (y./, yo, x/, etc.) that are used to calculate y u yi, and y?, respectively. According to 
the invention, IRAM 308 locations of values in boxes 655, 659, 663 and 669, etc., are 
increased by one for each increment of n. For example, box 655 includes values in 
locations 0 through 4, box 659 includes values in locations 1 through 5, box 663 includes 
values in locations 2 through 6, and box 669 includes values in locations 3 through 7, etc. 

10 Arrow 602 indicates that yi is stored in location three, which is one location higher than 
the location of yo. Similarly, arrow 604 indicates that y 2 is stored in location four, one 
location higher than the location of yy. Column n=123 shows that y 0 to y m are stored in • 
locations 2 to 125, respectively. Consequently, the invention, while writing the result of y n 
(e.g., yo in column n=0) over the oldest x value (e.g., x. 2 in the "Initial Data" column) 

15 permits the data for calculating y„ +i (e.g., y ; ) to appear perfectly ordered in the subsequent 
five locations (e.g., location 1 through location 5). In accordance with the invention, 
calculating and storing y n for subsequent sequences of 124 samples of jc, that is, 
calculating and storing y„ for n=124 to n=247, for n=248 to n=371, and for n=372 to 
n=495, etc., is similar to calculating and storing y n for n-0 to n= 123. The invention thus 

20 uses the same IRAM 308 locations from location 0 to location 127 for calculating and 
storing y n for subsequent sequences of 124 samples of x. As discussed above, the 
invention uses the zero value for y. 2 , y./> x. 2y and x.i for the first sequence of 124 samples 
of x. For a second sequence of 124 samples of .r the invention uses ym, ym, xm* and xm 
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for y. 2 , y-i, *-2, and x/, respectively. Similarly, for a third sequence of 124 samples of x the 
invention uses yz46, yi4i, x%46, and X247 for y.2, y, u Jc-2, and jt/, respectively. In calculating 
y rt , accelerator engine 200 stores filter coefficients preferably in RRAM 304 in order of at* 
aj, bz, bu and bo, 

5 The double precision biquad mode is similar to the (single) biquad mode, but the 

feedback state is stored and calculated as double precision for greater numerical stability. 
The equation for the double precision biquad mode is 

yn = box n + biXn-i + b& n -2 + myln-i + aiyl n -i + aiyh n -i + aiyhn-i (2) 
in which yl n represents the lower half bits and yh n represents the upper half bits of y n . In 
10 the preferred embodiment, a double precision term y n comprises 48 bits, and therefore 
each term y\ n and yh n comprises 24 bits. 

FIG. 7A shows a pipeline structure for the double precision biquad mode. 
FIG. 7B shows a resource utilization map for the double precision biquad mode. In 
this mode accelerator engine 200 operates in a pipeline fashion that repeats every nine 
15 cycles. 

The method for calculating y n in the double precision mode is similar to that for 
calculating in the biquad mode except, instead of five values (x n , x n -i, y n -h and y n -2) 
accelerator engine 200 uses seven values (x„, x n -h x n -2, yL-h y/n-2, yh n .i* and yhn-z) as 
required by equation (2), to calculate each y n . Further, after calculating y 0 , the invention 
20 stores yho and yk in respective locations 2 and 4. Similarly, after calculating y/, the 

invention stores yhi and yl } in respective locations 3 and 5, each location being one higher 
than the respective locations of yho and ylo. Thus for each y rt , the invention stores yh„ and 
yL each in one location higher than the locations of respective yft„-; and y/„.;. 
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FIG. 7C shows how accelerator engine 200 stores calculating and calculated 
values in IRAM 308 for the double precision biquad mode. The "Address" column shows 
locations from 0 to 127 in IRAM 308. The "Initial Data" column shows data of yh. 2 , yh. h 
yi 2y yl-t, x. 2y x/, and xq to xm in corresponding locations of the "Address" column. 
5 Columns 0, 1, 2, 3, ... to 121 show data in IRAM 308 locations for x n and y n for n=0 to 
«=121, respectively. Box 755 includes values (of yh. 2 , yh-u yl 2 , y/.y, x.2, xy, and xo) that 
are used to calculate y 0 . Similarly, boxes 759, 763, and 769 include values (yh. h yho, y/./, 
yl 0j x.i, etc.) that are used to calculate y/, y 2 , and y 3l respectively. According to the 
invention, IRAM 308 locations of values in boxes 759, 763, 769, etc., are increased by 

10 one for each increment of n. For example, box 755 includes values in locations 0 through 
6, box 759 includes values in locations 1 through 7, box 763 includes values in locations 2 
through 8, and box 769 includes values in locations 3 through 9, etc. Arrow 702 indicates 
that yhi is stored in location 3, which is one location higher than the location of yho. 
Arrow 703 indicates that yU is stored in location 5, which is one location higher than the 

15 location of yU Similarly, arrows 704 and 705 indicate that yh 2 and yh are each stored in 
locations one higher than the respective locations of yh } and yU, etc. Column «= 121 
shows that yho to yhm are stored in respective locations 2 to 123, while ylno and ylm are 
stored in respective locations 124 and 125, and xno and xm are stored in respective 
locations 126 and 127. Consequently, the invention while writing the result of yh n and yl n 

20 (e.g., yho and ylo in column n=0) over the oldest values of yl and x (e.g., yi 2 and x. 2 in the 
"Initial Data" column) permits the data for calculating y n +i (e.g., yi) to appear perfectly 
ordered in the subsequent seven locations (e.g., location one through location seven). As 
in the biquad filtering mode, calculating and storing y n for subsequent sequences of 122 
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samples of jc, that is, calculating and storing y n for n=122 to /z=243, for n=244 to n=365, 
and for n=366 to n=487, etc., is similar to calculating and storing y n for n=0 to /t= 122. 
The invention thus uses the same IRAM 308 locations from location 0 to location 127 for 
calculating and storing y n for subsequent sequences of 122 samples of x As discussed 

5 above, the invention uses the zero value for yh. 2y yh.u yU, yl-u *-2, and for the first 
sequence of 122 samples of x. For a second sequence of 122 samples of x the invention 
uses yhno* yhnu ylno, ylnu xno, and x m for yh. 2 , yh.u yU, yli* *-2, and x.u respectively. 
Similarly, for a third sequence of 122 samples of jc the invention uses yh 2 42> yfi243, yli4i, 
yh43, X242, and x 24 3 for yh.t, yh.u yU, yU, x- 2 , and x.u respectively. In calculating y„, 

10 accelerator engine 200 stores filter coefficients preferably in RRAM 304 in order of a 2 , au 
a 2 , du b 2 , bu bo* 

FIG. 8 is a flowchart illustrating how chip 100 invokes a function performed by 
accelerator engine 200. In step 804 chip 100 writes to configuration register 316 to set 
the required mode and halt accelerator engine 200. In step 808 chip 100 downloads 

15 required data from chip 100 to accelerator engine 200. For the pre-multiplication, IFFT, 
or post-multiplication modes chip 100 downloads data preferably in the order of 0 to 127 
and alternating between RRAM 304 and IRAM 308. For the biquad mode, chip 100 
downloads coefficients preferably in the order of a 2 , au b 2 , bu and bo in RRAM 304. 
Similarly, for the double precision biquad mode, chip 100 downloads coefficients 

20 preferably in the order of a 2> au a%, a u b 2y bu and bo. Chip 100 also downloads data in the 
order from 0 to 127, which is the order shown in the ^Initial Data" column in FIG. 6E. In 
step 812 chip 100 determines whether all of the data has been downloaded. If the data is 
not completely downloaded then chip 100 in step 808 keeps downloading data, but if data 
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is completely downloaded then chip 100 in step 816 sets the run bit in configuration 
register 316 so that accelerator engine 200 in step 820 can perform the requested function. 
Chip 100 in step 824 monitors the status of the done bit in configuration register 316 to 
determine whether accelerator engine 200 has completed its requested task. In step 828 
5 accelerator engine 200 completes its requested task, and, depending on the mode, chip 
100 may or may not set the done bit in configuration register 316. For example, if the 
requested task is a stand-alone pre-multiplication, then chip 100 sets the done bit, but if 
the task is an IDCT function then chip 100 does not set the done bit because accelerator 
engine 200 would continue to perform the IFFT function after completing the pre- 

10 multiplication function. In step 832 chip 100, via bus 2003 (FIG. 2), reads data from 
accelerator engine 200 in linear order except in the IFFT mode where IFFT functions 
leave data in bit-reverse order. 

The invention has been explained above with reference to a preferred embodiment. 
Other embodiments will be apparent to those skilled in the art after reading this disclosure. 

15 Therefore, these and other variations upon the preferred embodiment are intended to be 
covered by the appended claims. 
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WHAT IS CLAIMED IS: 



1 LA method for improving performance of an audio chip including a DSP, comprising the 

2 steps of: 

3 providing an apparatus having a plurality of elements running in parallel with said 

4 DSP; 

5 configuring said apparatus to perform a function according to a configuration setup; 

6 and 

7 employing said apparatus for accessing data from said elements in a pipeline 

8 structure to maximize utilization of said elements. 

1 2. The method of claim 1 wherein said function is usable in audio algorithms. 

1 3. The method of claim 1 wherein said function is selected from a group consisting of 

2 biquad filtering, double precision biquad filtering, IFFT, IDCT, pre-multiplication, and 

3 post-multiplication. 



1 4. The method of claim 1 wherein said plurality of elements includes: 



2 a first memory for storing real part data; 

3 a second memory for storing imaginary part data; 

4 a third memory for storing coefficient data; 

5 a multiplier for processing said real pan data, said imaginary part data, and said 



6 coefficient data; and 
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7 an ALU for processing said real part data, said imaginary part data, and said 

8 coefficient data. 

1 5. The method of claim 1 wherein in a post-multiplication function, data is accessed in bit- 

2 reverse order. 

1 6. The method of claim 1 wherein data is accessed in a four-cycle pipeline structure in a 

2 pre-multiplication function, in an IFFT function, and in a post-multiplication function, data 

3 is accessed in a six-cycle pipeline structure in a biquad mode, and data is accessed in a 

4 nine-cycle pipeline structure in a double precision biquad mode. 

1 7. The method of claim 1 wherein performing a biquad function comprises the steps of: 

2 receiving N+l samples of data x n for n = m to n = m + N; 

3 storing data including said samples of data in memory locations in a predefined 

4 order; and 

5 calculating y n according to the equation y n = box n + bix n -i + b 2 x n ~2 + aiy n -i + a2y n ^ 

1 8. The method of claim 7 wherein said predefined order comprises: y m . 2 in a location K n 

2 y m .j in a location K + 1 , x m . 2 in a location K + 2, x m - { in a location K + 3, and x m to x m + N in 

3 location a K + 4 through a location K + + 4. 
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1 9. The method of claim 8 wherein the step of calculating y n comprises the steps of: 

2 (i) using values ot> m . 2 , y m -iy x m . 2y x m .i and x m in respective locations AT, K + 1, K 

3 + 2, K + 3, and K + 4 to calculate a y m \ 

4 (ii) storing said y m in said location K + 2; 

5 (iii) incrementing m by 1; 

6 (iv) incrementing K by 1 ; and 

7 (v) returning to step (i). 

1 10, The method of claim 1 wherein performing a double precision biquad function 

2 comprising the steps of: 

3 receiving N + 1 samples of data x n torn = m to n = m + N\ 

4 storing data in memory locations in a predefined order; and 

5 calculating y n according to equation y n = box n + bjXn-i + b^x^i + + a2y/«-2 + 

6 aiyhn-i + cizyhn-z* 

1 11. The method of claim 10 wherein said predefined order comprises: yh m . 2 in a location 

2 yh m .i in a location £ + 1, yl m . 2 in a location K + 2 t y/ m .; in a location jST + 3, x m . 2 in a 

3 location K + 4, x m .; in a location AT + 5, and x m to jc«+iv in a location K + 6 through a 

4 location £ + iV + 6. 
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1 12. The method of claim 10 wherein the step of calculating y„ comprises the steps of: 

2 (i) using values of yh m . 2 , yh m .i, yl m . 2 , yUi and x m in respective 

3 locations K 9 K + 1 , K + 2, K + 3, K + 4 t £ + 5, and AT + 6 to calculate a y m ; 

4 (ii) storing a yh m and a yl m of said y m in said locations K + 2 and ^ + 4 

5 respectively; 

6 (iii) incrementing m by 1 ; 

7 (iv) incrementing AT by 1 ; and 

8 (v) returning to step (i). 

1 13. A method for performing a biquad function comprising the steps of: 

2 receiving N + 1 samples of data x n for n = m to n = m + iV; 

3 storing data in memory locations in a predefined order; and 

4 calculating y n according to the equation y n = fcax„ + b } x n -i + + aiy n -i + ^-2. 

1 14. The method of claim 13 wherein said predefined order comprises: y m . 2 in a location K, 

2 y m -i in a location K + 1 , x m . 2 in a location K + 2, ;t m -/ in a location + 3, and jt OT to x m +x in a 

3 location K + 4 through a location 7^ + ^ + 4. 
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1 15. The method of claim 14 wherein the step of calculating y n comprises the steps of: 

2 (i) using values ofy m ^ y m -i, x m . 2 , x m .i and x m in respective locations K, AT+ 1, 

3 K + 2, K + 3, K + 4 to calculate a y m ; 

4 (ii) storing said y m in said location K + 2; 

5 (iii) incrementing m by 1 ; 

6 (iv) incrementing K by 1 ; and 

7 (v) returning to step (i). 

1 16. A method for performing a double precision biquad function comprising the steps of: 

2 receiving N + 1 samples of data x n for n = m to n = m + AT; 

3 storing data including said samples of data in memory locations in a predefined 

4 order; and 

5 calculating y n according to the equation y n = b 0 x n + 6/;d-7 + 62^-2 + aiyU-i + a 2 y/„. 2 

6 + a/y/^-/ + a%yh n -i. 

1 17. The method of claim 16 wherein said predefined order comprises: yh m . 2 in a location 

2 K 9 yh m -i in a location K + 1 , y/ m . 2 in a location £ + 2, y/ mW in a location K + 3, x m . 2 in a 

3 location j£ + 4, x m ./ in a location K + 5, and x m to x m+ * in a location A: + 6 through a 

4 location K + N+6. 
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1 18. The method of claim 17 wherein the step of calculating y„ comprises the steps of: 

2 (i) using values of yh m . 2 , yh m . h yl m . 2 , yl m .i and x m in respective 

3 locations AT, K + 1, K + 2, K + 3, K + 4, AT + 5, and K + 6 to calculate a 

4 (ii) storing a yfom and a y/ m of said y m in said locations K + 2 and AT + 4 

5 respectively; 

6 (iii) incrementing /n by 1; 

7 (iv) incrementing K by 1 ; and 

8 (v) returning to step (i). 
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ACCELERATOR ENGINE FOR PROCESSING FUNCTIONS USED IN AUDIO 

ALGORITHMS 

ABSTRACT OF THE DISCLOSURE 
5 An engine for processing functions used in audio algorithms. The engine runs in 

parallel with a digital signal processor (DSP) in an audio chip to increase performance for 
that chip. Functions performed by the engine include biquad filtering and inverse discrete 
cosine transform (IDCT) including pre-multiplication, inverse Fast Fourier transform 
(IFFT), and post-multiplication, which would otherwise be performed by the DSP. The 
10 DSP is therefore free to perform other functions demanded by the chip. Resources in the 
engine are processed in a pipeline structure and are thus highly utilized. Data are stored in 
a predefined order to increase efficiency. 
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Group = 1; 

Block = FFT Length / 2; 
R2P = Log (FFT Length) ; 
for(i=0;i<R2P;i++) 

( 

Aiptr=0; 
Arptr=0; 
Biptr=Block: 
Brptr=Block; 
tbr(j=0;j<Group;j++) 

{ 

for(k=0;k<Block;k-H-) 

{ 

/* perform butterfly here */ 

ar = *Arptr; 
ai = *Aiptr; 
br = *Brptr; 
bi = *Biptr; 

rtemp = br * cr - bi * ci; 
itemp = br * ci + bi * cr; 



*ArptM 
*Aiptr-f 



= ar - rtemp; 
ai- itemp; 



*Brptr++ = ar + rtemp; 
*Biptr++ = ai + itemp; 



Aiptr+=block; 
Arptr+=Block; 
Biptr+=Block; 
Brptr+=Block; 

} 

Block»=l : 
Group«=l: 

} 



/* 64 or 32*/ 
/* 7 or 6 */ 

/* radix 2 pass counter */ 

/* initialize A imaginary pointer */ 
/* initialize A real pointer */ 
/* initialize B imaginar pointer */ 
/* initialize B real pointer */ 



/* fetch data */ 



/* perform complex multiply */ 
/* update and write back data */ 



/* update addresses to next group */ 



/* update block size for next radix 2 pass */ 
/* update group size for next radix 2 pass */ 
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